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Foreword 

This  project  fits  in  a  sequence  of  DARPA-funded  projects  at  Caltech  in  the 
area  of  VLSI  design.  The  performance  period  for  this  project  was  from 
September  11,  1994  until  September  10,  1997.  Two  no-cost  extensions  were 
granted:  the  first  one  from  September  11,  1997  until  September  10, 1998;  the 
second  one  from  September  11,  1998  until  March  10,  1999. 
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Objective 

The  objective  of  this  project  has  been  to  develop  a  complete  design  method¬ 
ology  for  asynchronous  VLSI  systems,  including  basic  circuit  theory,  syn¬ 
thesis  method,  optimization  techniques,  logical  and  electrical  simulations, 
CAD  tools,  testing,  microprocessor  architectures  for  high  performance  and 
low  power.  The  motivation  for  this  long-term  effort  is  that,  in  deep  submi¬ 
cron  technology,  asynchronous  circuits  offer  important  advantages  compared 
to  synchronous  design,  in  terms  of  logical  and  electrical  robustness,  perfor¬ 
mance,  power  consumption,  and  most  importantly  design  efficiency.  These 
advantages  are  even  more  critical  in  emerging  and  future  technologies. 

Asynchronous  techniques  are  based  on  local  communication  among  con¬ 
current  units.  Communication  and  synchronization  among  units  are  imple- 
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merited  by  handshake  protocols.  There  is  no  concept  of  global  time — no 
clocks  are  used— and  no  assumptions  are  made  about  the  duration  of  an 
action  (delay-insensitivity).  Several  advantages  derive  from  this  approach. 

1.  All  design  problems  associated  with  clocks  are  eliminated. 

2.  DI  circuits  are  very  robust  to  variations  in  physical  parameters  caused 
by  fabrication,  scaling,  modeling  errors,  metastability,  or  variations  in 
temperature  or  voltage. 

3.  Asynchronous  techniques  have  a  low-power  advantage  due  to  the  ab¬ 
sence  of  clocks,  the  locality  of  activity,  the  automatic  shut-off  of  inactive 
parts,  and  the  absence  of  spurious  transitions. 

4.  The  modularity,  scalability,  and  weak  synchronization  constraints  of 
asynchronous  techniques  drastically  simplify  the  overall  design  of  large 
systems. 

The  specific  objective  of  this  project  was  to  demonstrate  the  advantages 
of  asynchronous  techniques  for  the  design  of  both  high-performance  and  low- 
power  digital  VLSI  systems,  in  particular  general-purpose  microprocessors. 


Approach 

The  main  vehicle  for  this  demonstration  has  been  the  design  of  an  asyn¬ 
chronous  clone  of  a  commercial  microprocessor,  the  MIPS  R3000.  A  per¬ 
formance  advantage  in  terms  of  E  *  t2  (the  energy  per  instruction  times  the 
square  of  the  cycle  time)  compared  to  synchronous  designs  is  expected,  which 
will  demonstrate  that  an  asynchronous  design  with  a  good  E  *  t2  can  out¬ 
perform  low-power  synchronous  designs  at  low  voltage  and  high-performance 
synchronous  designs  at  high  voltage. 

A  new  generation  of  CAST,  a  suite  of  GAD  tools  for  the  design  of  asyn¬ 
chronous  circuits  has  been  developed.  CAST  includes  logic  synthesis,  per¬ 
formance  analysis,  digital,  analog,  and  mixed  simulation,  and  layout  tools. 
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Summary  of  Results 

The  main  result  of  this  research  has  been  to  design  a  new  method  for  high 
preformance  asynchronous  circuits  based  on  very  fine  pipelining.  The  method 
has  been  demonstrated  succesfully  through  the  design  of  two  main  chips:  an 
asynchronous  lattice-structure  digital  filter  and  the  MiniMIPS. 

With  2  million  transistors,  the  asynchronous  MiniMIPS  is  probably  the 
largest  VLSI  chip  and  the  fastest  microprocessor  successfully  designed  in 
academia.  It  was  found  to  be  entirely  functional  on  first  silicon.  It  is  about 
two-and-a-half  the  performance  of  commercial  microprocessors  of  the  same 
type  and  in  equivalent  technologies.  Performance  in  excess  of  400  MIPS 
are  expected  in  TSMC’s  0.25/i  CMOS  technology,  which  is  remarkable  given 
that  the  processor  issues  instructions  one  at  a  time  while  all  processors  in 
this  performance  range  use  multiple  instructions  issue 

The  Caltech  Filter 

The  Caltech  filter  is  an  asynchronous  circuit  for  a  programmable  FIR  filter 
of  1  to  80  coefficients.  It  was  designed  in  1994  to  test  the  new  design  ideas 
before  embarking  on  the  design  of  the  MiniMIPS.  Samples  and  coefficients 
are  12  bits.  The  chip  has  250K  transistors  in  MOSIS’s  HP  0.9p.m  technology. 
Measurements  show  that  the  chips  work  correctly  from  IV  to  5V  and  from 
77K  to  400K.  At  1.1V,  the  chip  operates  at  36  Mops  and  consumes  0.02W. 
Let  us  compare  it  to  the  low-power  Pleiades  chip  from  Berkeley.  Pleiades 
was  fabricated  in  0.5/mi  CMOS.  According  to  the  information  on  the  web 
page,  the  performances  are  3mW  at  14  Mops  with  a  word  size  of  16  bits. 
The  E  *  t2  of  the  Caltech  filter  is  0.4  *  10-24.  The  E  *t2  of  Pleiades  is 

io-24. 

Adjusted  for  feature  size  and  word  size,  the  performance  of  the  Caltech 
filter  is  more  than  a  factor  10  better  than  that  of  Pleiades,  even  though  the 
Caltech  filter  was  not  designed  for  low-power. 

An  Asynchronous  MIPS  R3000  Microprocessor 

The  asynchronous  MIPS  R3000  microprocessor  (the  “MiniMIPS”)  project 
had  two  main  goals:  first,  investigating  issues  in  asynchronous  processor  ar¬ 
chitecture  that  had  not  been  tackled  yet  (caches,  precise  exceptions,  register 
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bypassing,  branch-delay  slot,  and  branch  prediction),  secondly,  developing 
new  techniques  for  high-performance  asynchronous  digital  VLSI. 

The  first  prototype  was  fabricated  in  a  0.6  fira  SCMOS  process  (Hewlett- 
Packard  via  MOSIS).  Except  for  the  TLB  and  partial- word  instructions,  this 
prototype  implements  the  complete  ISA  specification  of  the  MIPS.  With  2 
million  transistors,  this  chip  is  probably  the  largest  VLSI  chip  and  the  fastest 
microprocessor  successfully  designed  in  academia.  It  was  found  to  be  entirely 
functional  on  first  silicon. 

According  to  extensive  SPICE  simulation,  at  3.3  V  and  75  °C,  the  Min- 
iMIPS  was  expected  to  dissipate  8  W  and  run  at  280  MIPS.  At  2.0  V  and 
75  °C,  it  should  dissipate  1  W  and  run  at  150  MIPS.  It  should  exceed 
300  MIPS  at  25  °C.  The  transistor  count  is  1.25M  for  the  caches,  and  750K 
for  the  datapath  and  control.  The  first  prototype  doesn’t  meet  the  expected 
performance:  This  specific  HP  run  was  20%  slower  than  normal,  and  a  layout 
error  left  a  long  poly  wire  in  a  critical  path  slowing  down  the  critical  path 
by  another  20%. 

The  prototype  runs  at  180  MIPS  and  consumes  4W  at  3.3  V  and  25  °C. 

It  is  about  two-and-a-half  times  the  performance  of  commercial  microproces¬ 
sors  of  the  same  type  and  in  equivalent  technologies.  Performance  in  excess 
of  400  MIPS  are  expected  in  TSMC’s  0.25/mi  CMOS  technology,  which  is 
remarkable  given  that  the  processor  issues  instructions  one  at  a  time  while 
all  processors  in  this  performance  range  use  multiple  instructions  issue 

Comparing  the  performance  of  the  MiniMIPS  to  that  of  existing  syn¬ 
chronous  and  asynchronous  processors  is  not  straightforward.  High-performance 
versions  of  the  MIPS  R3000  do  not  exist,  and  high-performance  processors 
fabricated  in  similar  technologies  are  quite  different.  Furthermore,  the  tech¬ 
nologies  used  in  current  microprocessor  designs  are  not  always  directly  com¬ 
parable  to  the  0.6  /mi  SCMOS  process  (Hewlett-Packard  via  MOSIS)  used 
for  the  MiniMIPS.  A  comparison  of  the  MiniMIPS  to  the  Orion  R4600,  the 
DEC  StrongARM,  and  the  AMULET2  is  summarized  in  the  table  below.  By 
MIPS,  we  mean  millions  of  R3000  or  R4000  instructions  per  second  for  the 
MiniMIPS  and  the  Orion.  The  numbers  for  the  AMULET  and  StrongARM 
refer  to  their  published  Dhrystone  MIPS  figures. 

The  graph  of  Figure  1  offers  another  way  to  compare  performances.  The 
solid  line  is  the  Iso-jF  *  t2  line  of  the  MiniMIPS  in  0.6  //m.  The  dotted  line 
is  an  adjustment  of  the  MiniMIPS  Iso-i?  *  t2  line  for  0.35  /im  technology. 
The  points  represent  data  for  different  processors.  The  scale  is  a  log-log 
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Power  vs.  instruction  time 


Figure  1:  Comparison  of  E  *  t2  figures  for  different  microprocessors.  The 
solid  ligne  represents  the  iso -E  *  t 2  ligne  for  the  MiniMIPS  in  loglog  scale. 
The  dotted  ligne  is  the  iso -E  *  t2  adjusted  for  0.35  micron  technology.  The 
points  closer  to  the  origin  correspond  to  better  performance. 

scale,  hence  the  straight  lines.  We  have  used  the  metric  E  *t2  because  it  is 
independent  of  voltage,  to  first  order,  and  therefore  permits  us  to  compare 
designs  that  operate  at  differing  voltages.  Titac2  is  an  asynchronous  MIPS 
microprocessor  designed  at  the  University  of  Tokyo. 


High-Performance  Asynchronous  Techniques 

The  performance  targeted  for  the  asynchronous  MIPS  required  a  drastic  im¬ 
provement  of  the  asynchronous  pipeline  mechanism.  A  new  ultra-fine  pipeline 
mechanism  was  introduced  whose  high  throughput  and  low  forward  latency 
bring  the  performance  of  asynchronous  circuits  on  a  par  with  the  best  clocked 
dynamic  pipelines. 

In  an  asynchronous  system,  since  there  is  no  clock  signal,  a  pipeline  stage 
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or  functional  unit  needs  to  generate  its  own  “completion  signal”  to  indicate 
the  completion  of  that  stage  and  the  availability  of  the  result  for  the  next 
stage.  The  generation  of  a  completion  signal  takes  a  time  proportional  to 
logN ,  where  N  is  the  number  of  bits  of  output  data.  Since  a  critical  cy¬ 
cle  usually  contains  2  completion  signals,  for  32  or  64  bits  of  output,  the 
performance  of  a  pipeline  including  such  a  mechanism  is  seriously  limited. 
The  completion  detection  problem  is  the  cause  of  the  remaining  skepticism 
against  asynchronous  techniques. 

This  problem  was  solved  by  pipelining  the  completion  mechanism.  The 
data  path  is  decomposed  in  small  units  -  say,  4  bit  wide  -  that  generate 
a  completion  signal  through  a  small,  constant,  delay.  The  collection  of  all 
the  unit  completion  signals  is  pipelined,  and  they  do  not  appear  on  any 
critical  cycle.  Now,  the  completion  detection  overhead  is  reduced  to  a  small 
constant.  By  increasing  the  pipelining,  this  method  could  increase  the  latency 
on  pipeline  stalls,  and  therefore  requires  additional  care  in  the  design  of  the 
datapath,  and  the  reduction  of  the  forward  latency. 

In  the  current  design  style,  the  forward  latency  of  a  pipeline  stage  is  only 
1  /6  of  the  cycle  time,  and  the  cycle  time  is  decreased  by  a  combination  of 
deep  pipelining  and  pipelined  completion.  The  frequencies  achieved  with 
this  approach  are  very  difficult  to  achieve  for  clocked  designs  due  to  clock 
frequency  limitations.  For  example,  the  latency  of  a  bare  pipeline  stage  in 
0.6^m  CMOS  is  500  picosecs,  which  corresponds  to  a  2  GigaHertz  clock  fre¬ 
quency.  Similarly,  the  low  forward  latencies  we  now  achieved  can  be  difficult 
to  match  in  clocked  design  because  of  the  margins  required  for  proper  clock 
operation. 


Other  Results 

We  developed  a  new  optimization  technique,  called  “slack  matching”,  to 
adjust  the  amount  of  buffering  present  in  the  channel.  In  a  finely  pipelined 
asynchronous  design  like  the  MIPS,  the  overall  throughput  performance  is 
critically  dependent  on  the  amount  of  slack  in  the  different  linear  and  circular 
pipelines  that  compose  the  architecture.  Adjusting  the  slack  of  pipelines  for 
performance  optimization  is  what  we  call  slack  matching. 

We  also  devised  necessary  and  sufficient  conditions  under  which  the  slack 
in  an  asynchronous  circuit  can  be  modified  without  affecting  correctness. 
The  results  let  the  designer  reason  about  a  complex  pipelined  processor  by 
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treating  it  as  an  unpipelined  system.  The  results  were  used  to  demonstrate 
the  correctness  of  new  program  transformations  for  slack  adjustment  used  in 
the  design  of  the  MIPS. 

A  novel  processor  architecture,  called  the  Branch  Processor,  was  invented 
to  address  the  problem  of  fetching  instructions  from  memory.  The  archi¬ 
tecture  includes  a  separate  stream  of  instructions  for  controlling  program 
counter  computation.  It  allows  performance  gains  from  being  able  to  fetch 
more  instructions  at  once,  reduced  code  size,  and  increased  cache  hit  rates. 

We  designed  a  new  synchronizer  which  can  be  used  to  sample  the  value 
of  a  wire  in  an  asynchronous  system.  The  circuit  functions  correctly  and  is 
guaranteed  to  produce  an  output  as  long  as  the  difference  between  Vdd  and 
GND  is  more  than  the  threshold  of  the  switching  devices. 

Finally,  a  systematic  approach  to  eliminating  charge  sharing  in  asyn¬ 
chronous  circuits  was  developed.  Because  a  signal  in  an  asynchrnous  circuit 
must  change  monotonically  from  one  voltage  level  to  the  other,  eliminating 
charge  sharing  is  critical.  (The  report  on  this  result  has  not  yet  been  written.) 

CAD  Tools 

A  new  suite  of  CAD  tools  was  developed  to  speed  up  the  development  of  high- 
performance  asynchronous  circuits.  At  the  electrical  simulation  level,  a  new 
analog  simulator  has  been  developed.  In  conjunction  with  the  new  analog 
simulator,  tools  have  been  developed  to  examine  automatically  the  simulation 
traces  and  to  pinpoint  circuit  problems  with  minimal  human  intervention. 
These  software  tools  are  also  able  to  speed  up  circuit  performance  tuning. 

We  designed  a  new  production-rule  simulator.  (Productio  rules  are  the 
last  level  of  logical  design  in  our  compilation  method.)  This  new  simulator 
is  currently  being  used  to  create  a  combined  digital-analog  simulator. 

We  also  designed  a  verification  tool  that  can  be  used  to  check  the  cor¬ 
rectness  of  physical  layout  against  the  last  level  of  logical  design  (production 
rules)  in  our  compilation  method.  The  program  is  tailored  for  asynchronous 
circuits,  but  can  also  be  used  (and  has  been  used)  to  check  synchronous 
circuits. 
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15.  Extended  Event-rule  Systems  and  Performance  Analysis  of 
Asynchronous  Systems 

with  Tony  Lee,  in  Proceedings  of  TAU95,  ACM  International  workshop 
on  Timing  Issues  in  the  Specification  and  Synthesis  of  Digital  Systems, 
November  1995. 

16.  Low-Energy  Asynchronous  Memory  Design 

with  Jose  A.  Tierno,  in  Proceedings  of  International  Symposium  on 
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Advanced  Research  in  Asynchronous  Circuits  and  Systems,  IEEE  Com¬ 
puter  Society  Press,  176-185,  1994. 

17.  An  Asynchronous  Pipeline  Lattice-structure  Filter 

with  Uri  Cummings  and  Andrew  Lines,  in  Proceedings  of  International 
Symposium  on  Advanced  Research  in  Asynchronous  Circuits  and  Sys¬ 
tems,  IEEE  Computer  Society  Press,  126-133,  1994. 

18.  Robustness  Issues  in  Asynchronous  VLSI  Techniques 

in  International  Symposium  on  Fault- Tolerant  Computing  ( FTCS-24 ) 
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3.  Pipelined  Asynchronous  Cache  Design,  Mika  Nystrom,  April  1999. 
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Devices  and  Methods  for  Asynchronous  Processing  Inventors:  Uri  V. 
Cummings;  Tak  K.  Lee;  Andrew  Lines;  Rajit  Manohar;  Alain  J.  Mar¬ 
tin;  Mika  Nystrom;  Paul  Penzes;  Robert  Southworth.  Date  Filed: 
7/16/98  Serial  No.  09/118,140. 

2.  CIT  No.  2654-F  Application  Type:  PCT  Title:  The  Design  of  an 
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mings;  Tak  K.  Lee;  Andrew  Lines;  Rajit  Manohar;  Alain  J.  Martin; 
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J.  Martin.  Date  Filed  9/15/97  Serial  No.  60/058,995 


13 
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